Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: deflake TestFollowerReadsWithStaleDescriptor #62833

Conversation

aayushshah15
Copy link
Contributor

@aayushshah15 aayushshah15 commented Mar 30, 2021

A preceding change (#62696) introduced a flakey update to this test.
Prior to that change, this test was using 2 voting replicas but that
change tried to make it use 1 voter and 1 non-voter instead (as a litmus
test for the new syntax added in #62696).

The test rebalances a replica away from a node and ensures that a
historical read sent immediately afterwards gets re-routed to the
leaseholder replica, since the receiving store had its replica
destroyed. However, when we're using a non-voter in this test, that
non-voter may not have learned about this replication change by the time
it receives this historical query and that fails the assertion.

This commit re-organizes the test and fixes the flake.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@aayushshah15 aayushshah15 force-pushed the 20210330_deflakeTestFollowerReadsWithStaleDescriptor branch 2 times, most recently from cbf6d8d to 5afab80 Compare March 31, 2021 06:11
Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @nvanbenschoten)


pkg/ccl/kvccl/kvfollowerreadsccl/followerreads_test.go, line 506 at r1 (raw file):

	var n2Addr, n3Addr syncutil.AtomicString
	waitForSnapApplication := make(chan struct{})

nit: name it snapApplication (or better yet snapApplied); that's what the channel carries. The fact that one of the sides uses it to "wait" for the signal is generally implied by channels.


pkg/ccl/kvccl/kvfollowerreadsccl/followerreads_test.go, line 519 at r1 (raw file):

							// `waitForSnapApplication` to signal that.
							AfterApplySnapshot: func(inSnap *kvserver.IncomingSnapshot) {
								if inSnap.SnapType == kvserver.SnapshotRequest_VIA_SNAPSHOT_QUEUE {

should we check for the right node id and range id before signaling?


pkg/ccl/kvccl/kvfollowerreadsccl/followerreads_test.go, line 601 at r1 (raw file):

	// do.
	select {
	case <-waitForSnapApplication:

instead if this new knob, could we just poll the store on n3 until it gets the desired replica? I would probably prefer that, since waiting for a particular snapshot seems only indirectly related to what we really want - which is n3 to get the replica. And you also had to add a lot of words about the "snapshot queue", which this test doesn't feel like it need to know about.


pkg/kv/kvserver/testing_knobs.go, line 195 at r1 (raw file):

	// AfterApplySnapshot is run after a snapshot is successfully applied by its
	// recipient.
	AfterApplySnapshot func(*IncomingSnapshot)

Does the arg need to be a pointer? We seem to pass that struct by value elsewhere.

@aayushshah15 aayushshah15 force-pushed the 20210330_deflakeTestFollowerReadsWithStaleDescriptor branch 2 times, most recently from 1f5d57e to 6fb9e96 Compare March 31, 2021 17:55
Copy link
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @andreimatei, and @nvanbenschoten)


pkg/ccl/kvccl/kvfollowerreadsccl/followerreads_test.go, line 601 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

instead if this new knob, could we just poll the store on n3 until it gets the desired replica? I would probably prefer that, since waiting for a particular snapshot seems only indirectly related to what we really want - which is n3 to get the replica. And you also had to add a lot of words about the "snapshot queue", which this test doesn't feel like it need to know about.

Changed it to just poll n3 for the non-voter, see now.

A preceeding change (cockroachdb#62696) introduced a flakey update to this test.
Prior to that change, this test was using 2 voting replicas but that
change tried to make it use 1 voter and 1 non-voter instead (as a litmus
test for the new syntax added in cockroachdb#62696).

The test rebalances a replica away from a node and ensures that a
historical read sent immediately afterwards gets re-routed to the
leaseholder replica, since the receiving store had its replica
destroyed. However, when we're using a non-voter in this test, that
non-voter may not have learned about this replication change by the time
it receives this historical query and that fails the assertion.

This commit re-organizes the test and fixes the flake.

Release note: None
@aayushshah15 aayushshah15 force-pushed the 20210330_deflakeTestFollowerReadsWithStaleDescriptor branch from 6fb9e96 to 04c09fe Compare March 31, 2021 18:26
Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @andreimatei, and @nvanbenschoten)

@aayushshah15
Copy link
Contributor Author

thanks for the review!

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 31, 2021

Build succeeded:

@andreimatei
Copy link
Contributor

hey @aayushshah15 did this fix #54494? The last failure there was on the same day that this merged. But before that one failure, the others are way older and they used to happen every day - which seems weird.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants